In this project I will analyze World Happiness dataset, which has 155 countries and different attributes. I took the dataset from Kaggle: https://www.kaggle.com/unsdsn/world-happiness/home
I use “Import Dataset” at the Global Environment to load the dataset. Then I convert the dataset to dataframe.
library(kableExtra)
library(knitr)
library(readr)
X2015 <- read_csv("~/Desktop/world-happiness-report/2015.csv")
happy_2015 <- as.data.frame(X2015)
I use some functions to have a glance at the dataset and understand it. I would like to check the column(attribute) names initially, to see if I should make any changes before starting to further explore the dataset. I like to change the names, optimize them in a meaningful and shorter manner. This way it makes it easier for me to call column names, in the future.
#Viewing column names
names(happy_2015)
## [1] "Country" "Region"
## [3] "Happiness Rank" "Happiness Score"
## [5] "Standard Error" "Economy (GDP per Capita)"
## [7] "Family" "Health (Life Expectancy)"
## [9] "Freedom" "Trust (Government Corruption)"
## [11] "Generosity" "Dystopia Residual"
It would be better to change all column names in a way that they won’t include white space in their names. It is also good to note their original name in the metadata, in case the attribute name we create is not clear enough. I keep them just to be on the safe side :)
#Changing column names
names(happy_2015) <- c("country", "region", "happiness_rank", "happiness_score", "std_error", "gdp_per_cpt", "family", "life_exp", "freedom", "trust_corruption", "generosity", "dystopia_residual")
Now we can start exploring the dataset. I will look at the head of the data set to see the happiest countries.
kable(head(happy_2015[, c("country", "region", "happiness_rank", "happiness_score")]))%>%
kable_styling(bootstrap_options = c("striped", "hover"))
| country | region | happiness_rank | happiness_score |
|---|---|---|---|
| Switzerland | Western Europe | 1 | 7.587 |
| Iceland | Western Europe | 2 | 7.561 |
| Denmark | Western Europe | 3 | 7.527 |
| Norway | Western Europe | 4 | 7.522 |
| Canada | North America | 5 | 7.427 |
| Finland | Western Europe | 6 | 7.406 |
I run initial summary statistics to have a general idea about the observaions. Also I look at the total number of missing values in the data.
#Viewing stats about each attribute
summary(happy_2015)
#structure
str(happy_2015)
#missing values
sum(is.na(happy_2015))
library(fBasics) #library for the summary table
#subsetting the dataset to only include numeric values
num_hap <- happy_2015[, c("happiness_rank", "happiness_score", "std_error", "gdp_per_cpt", "family", "life_exp", "freedom", "trust_corruption", "generosity", "dystopia_residual")]
#subsetting the summary table to view only stats below
basicStats(num_hap)[c("Mean", "Stdev", "Median", "Minimum", "Maximum"),]
## happiness_rank happiness_score std_error gdp_per_cpt family
## Mean 79.49367 5.375734 0.047885 0.846137 0.991046
## Stdev 45.75436 1.145010 0.017146 0.403121 0.272369
## Median 79.50000 5.232500 0.043940 0.910245 1.029510
## Minimum 1.00000 2.839000 0.018480 0.000000 0.000000
## Maximum 158.00000 7.587000 0.136930 1.690420 1.402230
## life_exp freedom trust_corruption generosity dystopia_residual
## Mean 0.630259 0.428615 0.143422 0.237296 2.098977
## Stdev 0.247078 0.150693 0.120034 0.126685 0.553550
## Median 0.696705 0.435515 0.107220 0.216130 2.095415
## Minimum 0.000000 0.000000 0.000000 0.000000 0.328580
## Maximum 1.025250 0.669730 0.551910 0.795880 3.602140
Our data seems pretty much clean. Additionally, there are no missing values in our dataset, which will accelerate our analysis. We don’t have to put any time to handle the missing values.
I also want to count how many countries and regions we have:
#Number of countries in our dataset
length(unique(happy_2015$country))
## [1] 158
#Number of regions
length(unique(happy_2015$region))
## [1] 10
Since we have 10 regions and 158 countries, we can subset our dataset to regions and look at the distributions. By analyzing regions separetely, we can find out about their characteristics. But before zooming into the regions(subsetting), we can have a look at the world in general.
Rank=1 indicates the happiest country in the world.Keeping that in the mind we can have a look the first 10 and last ten countries in the list. Top ten happiest countries:
kable(head(happy_2015, 10)) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| country | region | happiness_rank | happiness_score | std_error | gdp_per_cpt | family | life_exp | freedom | trust_corruption | generosity | dystopia_residual |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | Western Europe | 1 | 7.587 | 0.03411 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 |
| Iceland | Western Europe | 2 | 7.561 | 0.04884 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 |
| Denmark | Western Europe | 3 | 7.527 | 0.03328 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 |
| Norway | Western Europe | 4 | 7.522 | 0.03880 | 1.45900 | 1.33095 | 0.88521 | 0.66973 | 0.36503 | 0.34699 | 2.46531 |
| Canada | North America | 5 | 7.427 | 0.03553 | 1.32629 | 1.32261 | 0.90563 | 0.63297 | 0.32957 | 0.45811 | 2.45176 |
| Finland | Western Europe | 6 | 7.406 | 0.03140 | 1.29025 | 1.31826 | 0.88911 | 0.64169 | 0.41372 | 0.23351 | 2.61955 |
| Netherlands | Western Europe | 7 | 7.378 | 0.02799 | 1.32944 | 1.28017 | 0.89284 | 0.61576 | 0.31814 | 0.47610 | 2.46570 |
| Sweden | Western Europe | 8 | 7.364 | 0.03157 | 1.33171 | 1.28907 | 0.91087 | 0.65980 | 0.43844 | 0.36262 | 2.37119 |
| New Zealand | Australia and New Zealand | 9 | 7.286 | 0.03371 | 1.25018 | 1.31967 | 0.90837 | 0.63938 | 0.42922 | 0.47501 | 2.26425 |
| Australia | Australia and New Zealand | 10 | 7.284 | 0.04083 | 1.33358 | 1.30923 | 0.93156 | 0.65124 | 0.35637 | 0.43562 | 2.26646 |
The happiest counties are mostly in Western Europe.
Least ten happiest countries:
kable(tail(happy_2015, 10)) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| country | region | happiness_rank | happiness_score | std_error | gdp_per_cpt | family | life_exp | freedom | trust_corruption | generosity | dystopia_residual | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 149 | Chad | Sub-Saharan Africa | 149 | 3.667 | 0.03830 | 0.34193 | 0.76062 | 0.15010 | 0.23501 | 0.05269 | 0.18386 | 1.94296 |
| 150 | Guinea | Sub-Saharan Africa | 150 | 3.656 | 0.03590 | 0.17417 | 0.46475 | 0.24009 | 0.37725 | 0.12139 | 0.28657 | 1.99172 |
| 151 | Ivory Coast | Sub-Saharan Africa | 151 | 3.655 | 0.05141 | 0.46534 | 0.77115 | 0.15185 | 0.46866 | 0.17922 | 0.20165 | 1.41723 |
| 152 | Burkina Faso | Sub-Saharan Africa | 152 | 3.587 | 0.04324 | 0.25812 | 0.85188 | 0.27125 | 0.39493 | 0.12832 | 0.21747 | 1.46494 |
| 153 | Afghanistan | Southern Asia | 153 | 3.575 | 0.03084 | 0.31982 | 0.30285 | 0.30335 | 0.23414 | 0.09719 | 0.36510 | 1.95210 |
| 154 | Rwanda | Sub-Saharan Africa | 154 | 3.465 | 0.03464 | 0.22208 | 0.77370 | 0.42864 | 0.59201 | 0.55191 | 0.22628 | 0.67042 |
| 155 | Benin | Sub-Saharan Africa | 155 | 3.340 | 0.03656 | 0.28665 | 0.35386 | 0.31910 | 0.48450 | 0.08010 | 0.18260 | 1.63328 |
| 156 | Syria | Middle East and Northern Africa | 156 | 3.006 | 0.05015 | 0.66320 | 0.47489 | 0.72193 | 0.15684 | 0.18906 | 0.47179 | 0.32858 |
| 157 | Burundi | Sub-Saharan Africa | 157 | 2.905 | 0.08658 | 0.01530 | 0.41587 | 0.22396 | 0.11850 | 0.10062 | 0.19727 | 1.83302 |
| 158 | Togo | Sub-Saharan Africa | 158 | 2.839 | 0.06727 | 0.20868 | 0.13995 | 0.28443 | 0.36453 | 0.10731 | 0.16681 | 1.56726 |
Majority of the unhappiest countries are located in Sub-Saharan Africa. Countries which are not located in Sub-Saharan Africa and still included in this list are Syria and Afghanistan, which is not suprising if we consider the ongoing war and terrorism in those countries.
Since happiness rank is based on happiness score, I would like to analyze the average happiness score among regions and then try to focus on the characteristics of the happiest and unhappiest regions.
library(dplyr)
library(tidyr)
library(plotly)
#table with average happiness per region
avg_happiness_region <-happy_2015 %>%
group_by(region) %>%
summarise(avg_happiness = mean(happiness_score, round(1)))
#Plotting the average happiness scores to compare regions
p_avg_happiness_region <- plot_ly(avg_happiness_region, x = ~region,
y = ~avg_happiness,
type = 'bar',
name = 'Average Happiness') %>%
#add_trace(y = ~mean(happy_2015_copy$happiness_score), name = 'world')%>%
layout(title="Average Happiness per Region in 2015")
htmltools::tagList(list(p_avg_happiness_region))
Top 3 happiest regions based on average happiness score are: -Australia & New Zealand -North America -Western Europe
It is important to note that both of the first two regions include only 2 countries and Western Europe has 21 countries. Additionally all of these regions include countries with developed economies.
The unhappiest region is Sub-Saharan Africa, which includes 40 different countries.
Now we can create a correlogram to analyze the relationships. To do that we need only numeric columns. So I select the numeric columns only and create a dataframe called num_hap.
#names(happy_2015)
num_hap <- happy_2015[, c("happiness_rank", "happiness_score", "std_error", "gdp_per_cpt", "family", "life_exp", "freedom", "trust_corruption", "generosity", "dystopia_residual")]
library(corrplot)
source("http://www.sthda.com/upload/rquery_cormat.r")
rquery.cormat(num_hap)
## $r
## happiness_rank std_error generosity freedom
## happiness_rank 1
## std_error 0.16 1
## generosity -0.16 -0.088 1
## freedom -0.56 -0.13 0.37 1
## trust_corruption -0.37 -0.18 0.28 0.49
## dystopia_residual -0.52 0.084 -0.1 0.063
## gdp_per_cpt -0.79 -0.22 -0.01 0.37
## life_exp -0.74 -0.31 0.11 0.36
## happiness_score -0.99 -0.18 0.18 0.57
## family -0.73 -0.12 0.088 0.44
## trust_corruption dystopia_residual gdp_per_cpt life_exp
## happiness_rank
## std_error
## generosity
## freedom
## trust_corruption 1
## dystopia_residual -0.033 1
## gdp_per_cpt 0.31 0.04 1
## life_exp 0.25 0.019 0.82 1
## happiness_score 0.4 0.53 0.78 0.72
## family 0.21 0.15 0.65 0.53
## happiness_score family
## happiness_rank
## std_error
## generosity
## freedom
## trust_corruption
## dystopia_residual
## gdp_per_cpt
## life_exp
## happiness_score 1
## family 0.74 1
##
## $p
## happiness_rank std_error generosity freedom
## happiness_rank 0
## std_error 0.047 0
## generosity 0.044 0.27 0
## freedom 3e-14 0.1 1.3e-06 0
## trust_corruption 1.5e-06 0.025 0.00045 4.4e-11
## dystopia_residual 2e-12 0.29 0.21 0.43
## gdp_per_cpt 2.7e-34 0.006 0.9 1.7e-06
## life_exp 3.5e-28 7.3e-05 0.18 3.3e-06
## happiness_score 1.4e-142 0.026 0.023 6.9e-15
## family 5.8e-28 0.13 0.27 6.4e-09
## trust_corruption dystopia_residual gdp_per_cpt life_exp
## happiness_rank
## std_error
## generosity
## freedom
## trust_corruption 0
## dystopia_residual 0.68 0
## gdp_per_cpt 8.3e-05 0.62 0
## life_exp 0.0017 0.81 4.8e-39 0
## happiness_score 2.8e-07 7.6e-13 1.1e-33 5.8e-27
## family 0.0096 0.063 5.6e-20 7e-13
## happiness_score family
## happiness_rank
## std_error
## generosity
## freedom
## trust_corruption
## dystopia_residual
## gdp_per_cpt
## life_exp
## happiness_score 0
## family 9.9e-29 0
##
## $sym
## happiness_rank std_error generosity freedom
## happiness_rank 1
## std_error 1
## generosity 1
## freedom . . 1
## trust_corruption . .
## dystopia_residual .
## gdp_per_cpt , .
## life_exp , . .
## happiness_score B .
## family , .
## trust_corruption dystopia_residual gdp_per_cpt life_exp
## happiness_rank
## std_error
## generosity
## freedom
## trust_corruption 1
## dystopia_residual 1
## gdp_per_cpt . 1
## life_exp + 1
## happiness_score . . , ,
## family , .
## happiness_score family
## happiness_rank
## std_error
## generosity
## freedom
## trust_corruption
## dystopia_residual
## gdp_per_cpt
## life_exp
## happiness_score 1
## family , 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
Another way to show the correlation, with a heatmap. But I prefer the correlagram, as I find it easier to read.
cormat<-rquery.cormat(num_hap, graphType="heatmap")
Now we can focus on the attributes that have strong correlation with happiness score.
Strong correlations will give me an idea about which factors were more related to happiness score. It is always important to note that correlation and causation are different things!
cor(happy_2015$happiness_score, happy_2015$gdp_per_cpt)
## [1] 0.7809655
Countries that have higher gdp per capita seems to be happier. Since higher gdp per capita indicates a better standard of living for a country, it makes sense.
ggplot(happy_2015, aes(x=happy_2015$gdp_per_cpt, y=happy_2015$happiness_score))+
geom_point(aes(color = happy_2015$region)) +
geom_smooth(method="lm") +
xlab("GDP per Capita") +
ylab("Happiness Score") +
labs(colour="Region")
cor(happy_2015$happiness_score, happy_2015$family)
## [1] 0.7406052
Although the explanation family attribute is not clear in the metadata, family contributes to happier people.
ggplot(happy_2015, aes(x=happy_2015$family, y=happy_2015$happiness_score))+ geom_point(aes(color = happy_2015$region)) +
geom_smooth(method="lm") +
xlab("Family") +
ylab("Happiness Score") +
labs(colour="Region")
cor(happy_2015$happiness_score, happy_2015$life_exp)
## [1] 0.7241996
Life expectancy, probably the average years of life contributes to happier positively.
ggplot(happy_2015, aes(x=happy_2015$life_exp, y=happy_2015$happiness_score))+ geom_point(aes(color = happy_2015$region)) +
geom_smooth(method="lm") +
xlab("Lİfe Expectancy") +
ylab("Happiness Score") +
labs(colour="Region")
cor(happy_2015$happiness_score, happy_2015$freedom)
## [1] 0.5682109
Freedom doesn’t have a very strong relationship as life expectancy, gdp per capita and family.
To make a comparison between regions, or to analyze each region separetely in the future, I subset the dataset according to the region. ##Subsetting our dataset Now we have subsetted the dataset into the regions we can do analysis on each region.
unique(happy_2015$region)
## [1] "Western Europe" "North America"
## [3] "Australia and New Zealand" "Middle East and Northern Africa"
## [5] "Latin America and Caribbean" "Southeastern Asia"
## [7] "Central and Eastern Europe" "Eastern Asia"
## [9] "Sub-Saharan Africa" "Southern Asia"
#############################
#Happiest Regions
#############################
#Australia & New Zealand
aust_newzealand <- happy_2015[which(happy_2015$region == "Australia and New Zealand"), ]
#Subsetting Western Europe
w_europe <- happy_2015[which(happy_2015$region == "Western Europe"), ]
#North America
n_america <- happy_2015[which(happy_2015$region == "North America"), ]
#Happiest regions Altogether
happy_regions <- rbind(aust_newzealand, w_europe, n_america)
#############################
# Unhappiest Region
#############################
#Sub-Saharan Africa
sub_saharan_africa <- happy_2015[which(happy_2015$region == "Sub-Saharan Africa"), ]
#############################
# Other Regions
#############################
#Latin America & Caribbean
l_america <- happy_2015[which(happy_2015$region == "Latin America and Caribbean"), ]
#Middle East and Northern Africa
m_east_n_africa <- happy_2015[which(happy_2015$region == "Middle East and Northern Africa"), ]
#Central and Eastern Europe
central_easteu <- happy_2015[which(happy_2015$region == "Central and Eastern Europe"), ]
#Eastern Asia
east_asia <- happy_2015[which(happy_2015$region == "Eastern Asia"), ]
#Southern Asia
south_asia <- happy_2015[which(happy_2015$region == "Southern Asia"), ]
After subsetting the happiest regions as Aust. & New Zealand, Western Europe and North America, we have 25 countries.
library(ggplot2)
#subsetting numeric columns for drawing a histogram
num_hap_regions <- happy_regions[, c("happiness_rank", "happiness_score", "std_error", "gdp_per_cpt", "family", "life_exp", "freedom", "trust_corruption", "generosity", "dystopia_residual")]
#iterating through every column and plotting a histogram
column_names <- colnames(num_hap_regions)
for (i in 1:ncol(num_hap_regions)) {
(ggplot(num_hap_regions, aes(num_hap_regions[,i])) +
geom_histogram(fill="gray") +
xlab(column_names[i]))}
par(mfrow = c(5,2))
I didn’t wanted to display the histograms individually, as I want to zoom in to sum attributes specifically. However, you can run the code and look at the distributions.
num_hap_regions <- happy_regions[, c("happiness_rank", "happiness_score", "std_error", "gdp_per_cpt", "family", "life_exp", "freedom", "trust_corruption", "generosity", "dystopia_residual")]
rquery.cormat(num_hap_regions)
## $r
## happiness_score dystopia_residual gdp_per_cpt
## happiness_score 1
## dystopia_residual 0.85 1
## gdp_per_cpt 0.69 0.53 1
## trust_corruption 0.81 0.56 0.61
## generosity 0.61 0.28 0.28
## family 0.82 0.6 0.47
## freedom 0.82 0.47 0.56
## life_exp -0.0076 0.044 -0.048
## happiness_rank -0.99 -0.82 -0.69
## std_error -0.55 -0.38 -0.39
## trust_corruption generosity family freedom life_exp
## happiness_score
## dystopia_residual
## gdp_per_cpt
## trust_corruption 1
## generosity 0.46 1
## family 0.59 0.52 1
## freedom 0.73 0.63 0.77 1
## life_exp -0.13 -0.1 0.041 -0.12 1
## happiness_rank -0.79 -0.62 -0.86 -0.83 -0.039
## std_error -0.56 -0.3 -0.61 -0.47 0.11
## happiness_rank std_error
## happiness_score
## dystopia_residual
## gdp_per_cpt
## trust_corruption
## generosity
## family
## freedom
## life_exp
## happiness_rank 1
## std_error 0.59 1
##
## $p
## happiness_score dystopia_residual gdp_per_cpt
## happiness_score 0
## dystopia_residual 6.8e-08 0
## gdp_per_cpt 0.00013 0.0066 0
## trust_corruption 1e-06 0.004 0.0013
## generosity 0.0012 0.17 0.17
## family 4.1e-07 0.0015 0.017
## freedom 5.2e-07 0.017 0.0036
## life_exp 0.97 0.84 0.82
## happiness_rank 3.6e-22 4.9e-07 0.00012
## std_error 0.0042 0.058 0.054
## trust_corruption generosity family freedom life_exp
## happiness_score
## dystopia_residual
## gdp_per_cpt
## trust_corruption 0
## generosity 0.02 0
## family 0.0019 0.0075 0
## freedom 2.9e-05 0.00074 7.7e-06 0
## life_exp 0.54 0.62 0.85 0.57 0
## happiness_rank 3.3e-06 0.00089 5.1e-08 3.6e-07 0.85
## std_error 0.0038 0.14 0.0012 0.018 0.6
## happiness_rank std_error
## happiness_score
## dystopia_residual
## gdp_per_cpt
## trust_corruption
## generosity
## family
## freedom
## life_exp
## happiness_rank 0
## std_error 0.0019 0
##
## $sym
## happiness_score dystopia_residual gdp_per_cpt
## happiness_score 1
## dystopia_residual + 1
## gdp_per_cpt , . 1
## trust_corruption + . ,
## generosity ,
## family + . .
## freedom + . .
## life_exp
## happiness_rank B + ,
## std_error . . .
## trust_corruption generosity family freedom life_exp
## happiness_score
## dystopia_residual
## gdp_per_cpt
## trust_corruption 1
## generosity . 1
## family . . 1
## freedom , , , 1
## life_exp 1
## happiness_rank , , + +
## std_error . , .
## happiness_rank std_error
## happiness_score
## dystopia_residual
## gdp_per_cpt
## trust_corruption
## generosity
## family
## freedom
## life_exp
## happiness_rank 1
## std_error . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
library(tmap)
ggplot(happy_regions, aes(x=happy_regions$gdp_per_cpt, y=happy_regions$happiness_score))+
geom_point(aes(color = happy_regions$region)) +
geom_smooth(method="lm") +
scale_x_continuous(limits=c(1.2, 1.58)) +
scale_y_continuous(limits=c(5.5, 8.5)) +
ggtitle("Happiest Region: Happiness Score & Gdp Per Capita") +
xlab("GDP per Capita") +
ylab("Happiness Score") +
labs(colour="Region")
ggplot(sub_saharan_africa, aes(x=sub_saharan_africa$gdp_per_cpt, y=sub_saharan_africa$happiness_score))+ geom_point(aes(), color = "orange") +
geom_smooth(method="lm") +
scale_y_continuous(limits=c(2.5, 6)) +
xlab("GDP per Capita") +
ylab("Happiness Score") +
ggtitle("Sub Saharan Africa: Happiness Score & Gdp Per Capita") +
labs(color="Sub Saharan Africa")
ggplot(happy_regions, aes(x=happy_regions$trust_corruption, y=happy_regions$happiness_score))+
geom_point(aes(color = happy_regions$region)) +geom_smooth(method="lm") +
ggtitle("Happiest Regions: Happiness Score vs. Government Trust") +
xlab("Gov. Trust") +
ylab("Happiness Score") +
labs(colour="Region")
ggplot(happy_regions, aes(x=happy_regions$freedom, y=happy_regions$happiness_score))+
geom_point(aes(color = happy_regions$region)) +
geom_smooth(method="lm") +
xlab("Freedom") +
ylab("Happiness Score") +
ggtitle("Happiest Region: Happiness Score & Region") +
labs(colour="Region")
ggplot(sub_saharan_africa, aes(x=sub_saharan_africa$freedom, y=sub_saharan_africa$happiness_score))+
geom_point(aes(color = sub_saharan_africa$region))+
geom_smooth(method="lm") +
xlab("Freedom") +
ylab("Happiness Score") +
ggtitle("Sub-Saharan Africa: Happiness Score vs. Freedom")+
labs(colour="Region")
Freedom is stronger correlated with happiness score for happiest region. For Sub-Saharan Africa, freedom seems not to be highly correlated with happiness score.We can confirm the weak relationship for the unhappiest region by running a correlation matrix:
cor(sub_saharan_africa$happiness_score, sub_saharan_africa$gdp_per_cpt)
## [1] 0.3454735